Case Study 3: Fairness Testing for Algorithmic Pricing

A practical framework for fair algorithmic pricing

Author

Fei Huang

Published

June 11, 2026

Before you start

Read Step 4: Audit the system first for the qualitative audit protocol. This case study implements CDP and proxy-discrimination testing on Illinois auto insurance quotes.

Background

Regulatory agencies have increasingly required firms to assess whether algorithmic pricing systems produce unfair outcomes for protected groups. For example, recent regulatory proposals in Colorado and New York (Colorado Division of Insurance 2023; New York State Department of Financial Services 2024) require insurers to evaluate whether pricing algorithms and predictive models produce discriminatory pricing outcomes.

Many fairness audits rely on regression-based methods. A common approach is to compare pricing outcomes across protected groups after controlling for legitimate risk factors and then evaluate whether any remaining disparity falls within an acceptable tolerance.

This case study illustrates how regression-based fairness audits can be conducted for deterministic pricing outputs. It is based on Huang and Hooker (2026). In particular, we focus on two fairness criteria that are closely related to current regulatory proposals:

Conditional demographic parity (CDP). after controlling for legitimate risk factors, do pricing outcomes differ systematically across protected groups?
Proxy discrimination (PD). does a rating variable act as a statistical proxy for a protected attribute?

A distinctive feature of algorithmic pricing systems is that they are often deterministic. The same profile submitted to the same algorithm produces the same price. As a result, residuals from an audit regression represent approximation error rather than sampling variability. This distinction has important implications for statistical inference and forms the central motivation for the empirical analysis in this case study.

To illustrate the audit framework, we use the ProPublica Illinois auto insurance dataset (Larson et al. 2017), which contains quoted premiums from multiple insurers across Illinois zip codes. Download the public data release, extract car-insurance-public/data/il-per-zip.csv, and place it in this case study folder before running the code. See Data for details.

The analysis is designed to show how to run a regression-based fairness audit, not to deliver filing-ready compliance conclusions for any named insurer.

How to Read This Case Study

Not every section is needed for every reader. The table below suggests a practical reading order.

If your focus is…	Start here	Then read
Audit workflow and regulatory logic	Audit protocol	CDP audit → PD audit
What the Illinois results show	Data preparation	CDP audit and PD audit summary tables and figures
Implementation detail	Notation → Case Study Setup	Supporting Materials and code chunks (folded by default)

Audit protocol

This case study applies the Plan → Audit → Decide → Improve workflow from Step 4: Audit the system. Design choices are fixed before examining outcomes. The sections below show how that protocol is implemented on Illinois auto quotes.

Audit flow: Plan, Audit, Decide, Improve with Pass, Insufficient Information, or Fail

For this illustration we use CDP for system-level price gaps (with TOST) and PD for the suspected proxy variable log_risk. Protected attribute: majority-minority zip-code indicator. Legitimate factors: log state risk and a Chicago indicator. Tolerances and decision rules are in Audit parameters. Pass, fail, and insufficient-information outcomes follow the three-outcome framework in Step 4.

For practitioners

This case study applies one illustrative audit design from Huang and Hooker (2026) to Illinois auto quotes. In your own market you must choose:

Fairness criterion (CDP, PD, or another rule required by law)
Protected attribute A (directly observed or responsibly proxied)
Legitimate rating factors X_\ell (as recognised by your regulator)
Tolerance margins (for example, Colorado’s 5% price-gap discussion and standard 0.80 adverse-impact ratio)
Suspected proxy variables for PD screens

Quoted premiums here are deterministic algorithm outputs, so classical OLS standard errors are generally not valid. Use the corrected estimators shown below. Results are a teaching illustration using zip-level data and a proxied protected attribute, not a definitive regulatory finding about any company.

Scope of this case study

ProPublica Illinois auto quotes, company-level audits
Protected attribute: majority-minority zip-code indicator (a geographic proxy, not observed race)
Legitimate factors: state_risk, chicago
CDP via log-premium regression + TOST. PD illustration for log_risk
Corrected inference (HC3. PD cross-covariance)

The majority-minority zip indicator is a geographic proxy. For BISG or BIFSG-based regulatory designs, see Step 4: Proxied protected attributes and Xin, Hooker, and Huang (2026) on how proxy measurement distorts regression-based disparity estimates.

Not covered here. GLM audit specifications, formal sample-size planning, multi-proxy screens, BISG/BIFSG inference, or insurer-specific filing conclusions. See Huang and Hooker (2026) for the audit protocol and Xin, Hooker, and Huang (2026) for proxy-race measurement limits.

Case Study Setup

Data and audit setting

The empirical analysis is conducted at the company level. For each insurer, we observe quoted premiums across Illinois zip codes. The goal is to illustrate how regression-based fairness audits can be applied to deterministic pricing outputs.

To reproduce the results locally, place il-per-zip.csv in this folder (see Data). The code below expects DATA_PATH = "il-per-zip.csv".

The key variables are:

combined_premium, annual combined liability premium
minority, indicator equal to 1 if the zip code is majority-minority
state_risk, aggregate loss cost per insured vehicle
chicago, indicator for Chicago zip codes
companies_name, insurer name

In the analysis below, each insurer is audited separately. The protected attribute is represented by the majority-minority zip-code indicator (A_z = 1), while state_risk and chicago are used as legitimate rating factors (X_\ell). Symbols are defined in Notation.

Audit parameters

The table below maps pre-specified audit parameters to their regulatory interpretation. Values are fixed in code before running company-level tests.

Parameter	Code value	Role
Significance level	`ALPHA = 0.05`	90% confidence intervals for TOST (1 - 2\alpha)
Minimum zip codes per company	`MIN_OBS = 50`	Exclude thin company samples from company-level tests
CDP dollar tolerance	`DELTA_PCT = 0.05`	5% of mean premium (Colorado draft guidance)
CDP ratio tolerance	`TAU = 0.80`	Standard 0.80 adverse-impact ratio band
PD substantive threshold	`RHO_MIN = 0.10`	Flag only if relative coefficient shift \geq 10\%

Import packages and define audit parameters

The following code imports the required Python packages and sets the parameters above.

Show setup code

import sys
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
from IPython.display import display

DATA_PATH   = "il-per-zip.csv"
MIN_OBS     = 50
ALPHA       = 0.05
Z_CRIT      = stats.norm.ppf(1.0 - ALPHA)

# CDP TOST tolerance margins
DELTA_PCT   = 0.05
TAU         = 0.80
LOG_TAU     = np.log(1.0 / TAU)

# PD substantive shift threshold
RHO_MIN     = 0.10

Supporting functions

The following functions implement the calculations used throughout the CDP and PD audits. The code is folded for readability but can be expanded if desired.

Show supporting functions code

def ols_fit(X, y):
    """
    OLS estimator: beta_hat = (X'X)^(-1) X'y.
    """
    XtX_inv = np.linalg.inv(X.T @ X)
    beta    = XtX_inv @ (X.T @ y)
    resid   = y - X @ beta
    return beta, resid, XtX_inv


def cov_classical(resid, XtX_inv, n, p):
    """
    Classical OLS covariance: sigma^2 * (X'X)^(-1).
    """
    sigma2 = float(resid @ resid) / (n - p)
    return sigma2 * XtX_inv, sigma2


def cov_hc3(X, resid, XtX_inv):
    """
    HC3 sandwich covariance estimator.
    """
    h    = np.einsum('ij,jk,ik->i', X, XtX_inv, X)
    w    = resid / (1.0 - h)
    meat = (X * w[:, np.newaxis]).T @ (X * w[:, np.newaxis])
    return XtX_inv @ meat @ XtX_inv


def cov_shift_full(X_res, X_ext, fz, j=1, k=1):
    """
    Correct variance of the coefficient shift for proxy discrimination.

    X_res : restricted model matrix, excluding the protected attribute.
    X_ext : extended model matrix, including the protected attribute.
    fz    : deterministic response vector.
    j, k  : column indices for the suspected proxy variable in the two models.
    """
    n = len(fz)

    XtX_inv_res = np.linalg.inv(X_res.T @ X_res)
    XtX_inv_ext = np.linalg.inv(X_ext.T @ X_ext)

    sc_res = X_res * fz[:, np.newaxis]
    sc_ext = X_ext * fz[:, np.newaxis]

    meat_res  = (sc_res.T @ sc_res) / n
    meat_res -= np.outer(sc_res.mean(axis=0), sc_res.mean(axis=0))
    sw_res    = XtX_inv_res @ meat_res @ XtX_inv_res
    var_res   = sw_res[j, j]

    meat_ext  = (sc_ext.T @ sc_ext) / n
    meat_ext -= np.outer(sc_ext.mean(axis=0), sc_ext.mean(axis=0))
    sw_ext    = XtX_inv_ext @ meat_ext @ XtX_inv_ext
    var_ext   = sw_ext[k, k]

    var_ind = var_res + var_ext

    cross_meat  = (sc_res.T @ sc_ext) / n
    cross_meat -= np.outer(sc_res.mean(axis=0), sc_ext.mean(axis=0))
    cross       = float(XtX_inv_res[j, :] @ cross_meat @ XtX_inv_ext[:, k])

    var_full = max(var_res + var_ext - 2.0 * cross, 0.0)

    return var_full, var_ind


def tost_verdict(beta_A, se_hc3, mean_premium, alpha=ALPHA,
                 delta_pct=DELTA_PCT, tau=TAU):
    """
    Three-outcome TOST decision for CDP: Pass, Fail, or
    Insufficient Information.
    """
    z = stats.norm.ppf(1.0 - alpha)
    ci_lo = beta_A - z * se_hc3
    ci_hi = beta_A + z * se_hc3

    log_tau = np.log(1.0 / tau)
    delta = delta_pct * mean_premium

    ratio_inside = (ci_lo > -log_tau) and (ci_hi < log_tau)
    ratio_outside = (ci_lo >= log_tau) or (ci_hi <= -log_tau)

    gap_lo = mean_premium * (np.exp(ci_lo) - 1.0)
    gap_hi = mean_premium * (np.exp(ci_hi) - 1.0)

    level_inside = (gap_lo > -delta) and (gap_hi < delta)
    level_outside = (gap_lo >= delta) or (gap_hi <= -delta)

    if ratio_inside and level_inside:
        verdict = "PASS"
    elif ratio_outside or level_outside:
        verdict = "FAIL"
    else:
        verdict = "INSUFFICIENT INFORMATION"

    return verdict, ci_lo, ci_hi, gap_lo, gap_hi


def pd_verdict(shift, phi_res, z_full, z_crit=Z_CRIT, rho_min=RHO_MIN):
    """
    Two-part proxy-discrimination decision rule.
    """
    rel_pct = abs(shift / phi_res) * 100.0 if phi_res != 0 else np.nan
    flag    = (abs(z_full) > z_crit) and (rel_pct >= rho_min * 100.0)
    return flag, rel_pct

Data preparation

Show data loading code

def prepare_data(path):
    df = pd.read_csv(path)

    required = {
        "companies_name", "minority", "state_risk",
        "combined_premium", "chicago"
    }
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns in CSV: {missing}")

    df = df.copy()
    df["minority_flag"] = df["minority"].astype(int)
    df["log_risk"] = np.log(df["state_risk"])
    df["log_premium"] = np.log(df["combined_premium"])
    df["excess_premium"] = df["combined_premium"] / df["state_risk"]
    df["chicago"] = df["chicago"].astype(float)

    return df


df = prepare_data(DATA_PATH)

print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")

df.head()

Number of observations: 31382
Number of variables: 15

	zipcode	minority	companies_name	name	bi_policy_premium	pd_policy_premium	state_risk	white_non_hisp_pct	risk_difference	combined_premium	log_risk	log_premium	excess_premium
0	60002	False	State Farm Fire & Cas Co	STATE FARM GRP	414	0.0	200.692503	88.6	213.307497	414.0	5.301774	6.025866	2.062857
1	60002	False	Metropolitan Grp Prop & Cas Ins Co	METROPOLITAN GRP	321	244.0	200.692503	88.6	364.307497	565.0	5.301774	6.336826	2.815252
2	60002	False	Progressive Direct Ins Co	PROGRESSIVE GRP	360	176.0	200.692503	88.6	335.307497	536.0	5.301774	6.284134	2.670752
3	60002	False	American Family Mut Ins Co	AMERICAN FAMILY INS GRP	458	0.0	200.692503	88.6	257.307497	458.0	5.301774	6.126869	2.282098
4	60002	False	Garrison Prop & Cas Ins Co	UNITED SERV AUTOMOBILE ASSN GRP	171	171.0	200.692503	88.6	141.307497	342.0	5.301774	5.834811	1.704100

The dataset contains zip-code-company observations. Before running company-level audits, we first summarise the unconditional differences between majority-minority and non-majority-minority zip codes.

Show summary statistics code

def table_summary_stats(df):
    non_mm = df[df["minority_flag"] == 0]
    mm = df[df["minority_flag"] == 1]

    rows = []
    for var, label, scale in [
        ("combined_premium", "Combined premium ($)", 1),
        ("state_risk", "State risk ($)", 1),
        ("excess_premium", "Excess premium ($)", 1),
        ("minority", "Pct minority (%)", 100),
    ]:
        full = df[var] * scale
        non_mm_var = non_mm[var] * scale
        mm_var = mm[var] * scale

        rows.append({
            "Variable": label,
            "Mean": round(full.mean(), 1),
            "SD": round(full.std(), 1),
            "Max": round(full.max(), 1),
            "Non-majority-minority zips": round(non_mm_var.mean(), 1),
            "Majority-minority zips": round(mm_var.mean(), 1),
            "Ratio": round(mm_var.mean() / non_mm_var.mean(), 3)
        })

    return pd.DataFrame(rows)

summary_stats = table_summary_stats(df)
display(summary_stats)

	Variable	Mean	SD	Max	Non-majority-minority zips	Majority-minority zips	Ratio
0	Combined premium ($)	370.2	147.7	1345.0	355.9	482.2	1.355
1	State risk ($)	162.7	50.9	297.5	160.6	179.2	1.116
2	Excess premium ($)	2.5	1.3	15.0	2.5	2.8	1.121
3	Pct minority (%)	11.3	31.6	100.0	0.0	100.0	inf

Note

The unconditional premium difference is not itself a fairness-audit conclusion, because majority-minority and non-majority-minority zip codes may differ in risk and geography. The regression audit below controls for state risk and a Chicago indicator.

Notation

Symbol	Meaning	Variable in this case study
P_{kz}	Quoted premium for company k in zip z	`combined_premium`
A_z	Protected attribute	`minority_flag` (1 = majority-minority zip)
X_{\ell,z}	Legitimate rating factors	`log_risk`, `chicago`
\beta_{A,k}	Conditional log-premium gap on A for company k	CDP regression coefficient
\Delta_\mu	Conditional mean premium gap	Implied from \beta_{A,k} at mean premium
R_\mu	Conditional premium ratio across groups	\exp(\beta_{A,k})
\phi, \phi'	Coefficient on suspected proxy in restricted / extended PD models	Coefficient on `log_risk`
\Delta_{PD}	Proxy coefficient shift \phi - \phi'	PD test statistic
HC3 SE	Corrected standard error for deterministic outputs	`se_hc3`
\delta	Dollar tolerance for CDP	`DELTA_PCT` \times mean premium
\tau	Adverse-impact ratio tolerance	`TAU = 0.80`
\rho_{\min}	Minimum relative PD shift to flag	`RHO_MIN = 0.10`

Rule abbreviations. CDP = conditional demographic parity. PD = proxy discrimination. TOST = two one-sided tests (equivalence testing).

Conditional Demographic Parity Audit

Audit model

The CDP audit evaluates whether pricing outcomes differ across protected groups after controlling for legitimate rating factors. For each company k, we fit the audit regression (see Notation)

\log(P_{kz}) = \mu_{0k} + \beta_{A,k} A_z + \gamma_{1k} \log(\text{state risk}_z) + \gamma_{2k} \text{Chicago}_z + r_{kz},

where P_{kz} is the quoted premium for company k in zip code z, and A_z is the majority-minority zip-code indicator. The coefficient \beta_{A,k} measures the conditional log-premium gap for company k.

Show design matrix code

def build_design_matrices(grp):
    n = len(grp)
    ones = np.ones(n)
    A = grp["minority_flag"].values.astype(float)
    lr = grp["log_risk"].values
    C = grp["chicago"].values

    X_cdp = np.column_stack([ones, A, lr, C])
    X_res = np.column_stack([ones, lr, C])
    X_ext = np.column_stack([ones, lr, C, A])
    fz = grp["log_premium"].values

    return X_cdp, X_res, X_ext, fz, n

Classical versus HC3 standard errors

The first issue is whether the usual classical OLS standard error is appropriate. Since quoted premiums are deterministic outputs, the residuals are approximation errors rather than independent stochastic noise. We therefore compare the usual classical standard error with the HC3 sandwich standard error. HC3 is a finite-sample heteroskedasticity-consistent sandwich estimator proposed by MacKinnon and White (1985) and used here as the default corrected standard error.

Show CDP analysis code

def analyse_cdp_company(company, grp):
    X_cdp, X_res, X_ext, fz, n = build_design_matrices(grp)
    p_cdp = X_cdp.shape[1]

    # CDP regression
    beta_cdp, resid_cdp, XtX_inv_cdp = ols_fit(X_cdp, fz)
    cov_cl, sigma2 = cov_classical(resid_cdp, XtX_inv_cdp, n, p_cdp)
    cov_hc = cov_hc3(X_cdp, resid_cdp, XtX_inv_cdp)

    beta_A = float(beta_cdp[1])
    se_cl = float(np.sqrt(cov_cl[1, 1]))
    se_hc3 = float(np.sqrt(cov_hc[1, 1]))
    rho = se_hc3 / se_cl

    ss_tot = float(np.sum((fz - fz.mean()) ** 2))
    r2 = 1.0 - float(resid_cdp @ resid_cdp) / ss_tot

    mean_prem_log = fz.mean()
    mean_prem_usd = float(np.exp(mean_prem_log))

    verdict, ci_lo, ci_hi, gap_lo, gap_hi = tost_verdict(
        beta_A, se_hc3, mean_prem_usd
    )

    ratio = float(np.exp(beta_A))
    gap_dollars = float(mean_prem_usd * (np.exp(beta_A) - 1.0))

    cdp_rec = {
        "company": company,
        "n": n,
        "beta_A": beta_A,
        "se_classical": se_cl,
        "se_hc3": se_hc3,
        "rho": rho,
        "r2": r2,
    }

    audit_rec = {
        "company": company,
        "gap_usd": gap_dollars,
        "ratio": ratio,
        "ci_lo": ci_lo,
        "ci_hi": ci_hi,
        "gap_lo": gap_lo,
        "gap_hi": gap_hi,
        "verdict": verdict,
    }

    return cdp_rec, audit_rec


cdp_records = []
audit_records = []

for company, grp in df.groupby("companies_name"):
    if len(grp) >= MIN_OBS:
        cdp_rec, audit_rec = analyse_cdp_company(company, grp)
        cdp_records.append(cdp_rec)
        audit_records.append(audit_rec)

cdp_results = pd.DataFrame(cdp_records).sort_values("company").reset_index(drop=True)
audit_results = pd.DataFrame(audit_records).sort_values("company").reset_index(drop=True)

print(f"Number of companies audited: {len(cdp_results)}")
cdp_results.head()

Number of companies audited: 34

	company	n	beta_A	se_classical	se_hc3	rho	r2
0	Allstate Fire & Cas Ins Co	923	0.215017	0.016442	0.018462	1.122903	0.437081
1	Allstate Ind Co	923	0.317649	0.022141	0.024226	1.094157	0.507276
2	American Family Mut Ins Co	923	0.150669	0.012237	0.010640	0.869509	0.502472
3	American Standard Ins Co of WI	923	0.136981	0.013006	0.010412	0.800583	0.430596
4	Country Mut Ins Co	923	0.260289	0.017204	0.020499	1.191543	0.349394

The next table summarises the company-level comparison between classical and HC3 standard errors.

Show CDP standard error summary code

se_summary = pd.DataFrame({
    "Number of companies": [len(cdp_results)],
    "Min HC3/classical SE ratio": [cdp_results["rho"].min()],
    "Mean HC3/classical SE ratio": [cdp_results["rho"].mean()],
    "Max HC3/classical SE ratio": [cdp_results["rho"].max()],
    "Companies with |ratio - 1| > 0.15": [
        (abs(cdp_results["rho"] - 1) > 0.15).sum()
    ]
})

display(se_summary.round(3))

	Number of companies	Min HC3/classical SE ratio	Mean HC3/classical SE ratio	Max HC3/classical SE ratio	Companies with \|ratio - 1\| > 0.15
0	34	0.685	1.065	1.775	14

The table below reports selected companies where the HC3 correction has the largest and smallest effect on the standard error for the protected-attribute coefficient.

Show selected CDP standard error results code

selected_se = pd.concat([
    cdp_results.nsmallest(3, "rho"),
    cdp_results.nlargest(3, "rho")
]).drop_duplicates(subset="company")

selected_se = selected_se[[
    "company", "n", "beta_A", "se_classical", "se_hc3", "rho", "r2"
]].copy()

selected_se = selected_se.round({
    "beta_A": 3,
    "se_classical": 3,
    "se_hc3": 3,
    "rho": 3,
    "r2": 3
})

display(selected_se)

	company	n	beta_A	se_classical	se_hc3	rho	r2
30	Trumbull Ins Co	923	0.130	0.012	0.009	0.685	0.405
26	State Farm Fire & Cas Co	923	0.222	0.017	0.013	0.753	0.450
27	State Farm Mut Auto Ins Co	923	0.222	0.017	0.013	0.753	0.450
6	Economy Preferred Ins Co	923	0.308	0.017	0.030	1.775	0.476
18	Metropolitan Cas Ins Co	923	0.308	0.017	0.029	1.679	0.482
9	Farmers Automobile Ins Assoc	923	0.294	0.024	0.040	1.660	0.675

Show CDP standard error ratio plot code

plot_df = cdp_results.sort_values("rho").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["rho"], marker="o", linestyle="")
plt.axhline(1.0, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by SE ratio")
plt.ylabel("HC3 SE / classical SE")
plt.title("Classical standard errors can under- or over-estimate uncertainty")
plt.tight_layout()
plt.show()

Figure 1: Ratio of HC3 to classical standard errors for the protected-attribute coefficient.

Note

A ratio above 1 means the classical standard error is smaller than the HC3 standard error. A ratio below 1 means the classical standard error is larger. The direction is not known in advance. This is why the corrected standard error should be computed rather than assumed.

CDP decision rule using TOST

A conventional significance test asks whether the conditional disparity is exactly zero. However, regulatory fairness audits are typically concerned with a different question: whether any disparity is small enough to fall within a pre-specified tolerance.

To address this, we use an equivalence-testing framework known as Two One-Sided Tests (TOST) (Schuirmann 1987). Under TOST, the burden is not to show that the disparity is exactly zero, but to demonstrate that it is sufficiently small to fall within an acceptable range.

Here we use two tolerance conditions:

the confidence interval for the log-premium gap must lie within the log-ratio band corresponding to an adverse-impact ratio of 0.80. And
the implied dollar gap must lie within 5% of the mean premium.

A company passes only if both confidence intervals lie entirely within the tolerance bands. It fails if either interval lies entirely outside. Otherwise the verdict is insufficient information the sample is too imprecise to confirm compliance or a material disparity (Huang and Hooker (2026)).

Show CDP verdict summary code

cdp_summary = (
    audit_results["verdict"]
    .value_counts()
    .rename_axis("Verdict")
    .reset_index(name="Number of companies")
)

display(cdp_summary)

	Verdict	Number of companies
0	FAIL	34

Show selected CDP verdict results code

selected_cdp = pd.concat([
    audit_results.nsmallest(3, "gap_usd"),
    audit_results.nlargest(3, "gap_usd")
]).drop_duplicates(subset="company")

selected_cdp = selected_cdp[[
    "company", "gap_usd", "ratio", "ci_lo", "ci_hi", "verdict"
]].copy()

selected_cdp = selected_cdp.round({
    "gap_usd": 0,
    "ratio": 3,
    "ci_lo": 3,
    "ci_hi": 3
})

display(selected_cdp)

	company	gap_usd	ratio	ci_lo	ci_hi	verdict
31	USAA Cas Ins Co	18.0	1.095	0.075	0.105	FAIL
33	United Serv Automobile Assn	19.0	1.099	0.078	0.111	FAIL
30	Trumbull Ins Co	28.0	1.138	0.116	0.144	FAIL
1	Allstate Ind Co	231.0	1.374	0.278	0.357	FAIL
19	Metropolitan Grp Prop & Cas Ins Co	162.0	1.323	0.240	0.320	FAIL
18	Metropolitan Cas Ins Co	140.0	1.361	0.260	0.355	FAIL

Show CDP premium gap plot code

plot_df = audit_results.sort_values("gap_usd").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["gap_usd"], marker="o", linestyle="")
plt.axhline(0.0, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by estimated gap")
plt.ylabel("Estimated dollar gap")
plt.title("Conditional premium gap for majority-minority zip codes")
plt.tight_layout()
plt.show()

Figure 2: Estimated conditional dollar premium gap across companies.

Note

Unlike conventional significance testing, TOST requires positive evidence that the disparity is sufficiently small. Insufficient information is not a pass. It means the audit cannot yet close and more data or a pre-specified escalation step may be needed.

Proxy Discrimination Audit

Coefficient-shift idea

Proxy discrimination asks whether a rating variable acts as a statistical substitute for the protected attribute. In this illustration, the suspected proxy variable is log_risk.

We compare two regressions:

\text{Restricted model: } \log(P_i) = \mu + \phi \log(\text{risk}_i) + \gamma \text{Chicago}_i + r_i,

\text{Extended model: } \log(P_i) = \mu' + \phi' \log(\text{risk}_i) + \gamma' \text{Chicago}_i + \kappa A_i + r'_i.

The coefficient shift is

\Delta_{PD} = \phi - \phi'.

If the coefficient on log_risk changes meaningfully after adding the protected attribute, this suggests that log_risk was absorbing some protected-group variation in the restricted model.

Standard versus corrected standard errors

The usual independent-samples formula treats the two coefficient estimates as independent. That is not appropriate here because both regressions are fitted to the same deterministic premium vector. The corrected standard error includes the cross-covariance between the two estimates.

Show proxy discrimination analysis code

def analyse_pd_company(company, grp):
    X_cdp, X_res, X_ext, fz, n = build_design_matrices(grp)

    # PD coefficient-shift test for log_risk
    beta_res, _, _ = ols_fit(X_res, fz)
    beta_ext, _, _ = ols_fit(X_ext, fz)

    phi_res = float(beta_res[1])
    phi_ext = float(beta_ext[1])
    shift = phi_res - phi_ext

    var_full, var_ind = cov_shift_full(X_res, X_ext, fz, j=1, k=1)
    se_full = float(np.sqrt(var_full))
    se_ind = float(np.sqrt(var_ind))

    z_ind = shift / se_ind if se_ind > 0 else np.nan
    z_full = shift / se_full if se_full > 0 else np.nan

    flag, rel_pct = pd_verdict(shift, phi_res, z_full)

    pd_rec = {
        "company": company,
        "shift": rel_pct,
        "se_ind": se_ind,
        "se_full": se_full,
        "ratio_se": se_full / se_ind if se_ind > 0 else np.nan,
        "z_ind": z_ind,
        "z_full": z_full,
        "flag": "FLAG" if flag else "",
    }

    return pd_rec


pd_records = []

for company, grp in df.groupby("companies_name"):
    if len(grp) >= MIN_OBS:
        pd_rec = analyse_pd_company(company, grp)
        pd_records.append(pd_rec)

pd_results = pd.DataFrame(pd_records).sort_values("company").reset_index(drop=True)

print(f"Number of companies audited: {len(pd_results)}")
pd_results.head()

Number of companies audited: 34

	company	shift	se_ind	se_full	ratio_se	z_ind	z_full	flag
0	Allstate Fire & Cas Ins Co	9.681743	0.026227	0.002163	0.082484	0.756896	9.176298
1	Allstate Ind Co	10.485194	0.027659	0.002327	0.084133	1.060300	12.602721	FLAG
2	American Family Mut Ins Co	7.735429	0.026078	0.002129	0.081637	0.533416	6.534027
3	American Standard Ins Co of WI	8.000026	0.028534	0.002315	0.081120	0.443219	5.463756
4	Country Mut Ins Co	10.575810	0.024418	0.002025	0.082924	0.984158	11.868223	FLAG

Show proxy discrimination summary code

pd_summary = pd.DataFrame({
    "Number of companies": [len(pd_results)],
    "Significant using independent SE": [(abs(pd_results["z_ind"]) > Z_CRIT).sum()],
    "Significant using corrected SE": [(abs(pd_results["z_full"]) > Z_CRIT).sum()],
    "Flagged using corrected two-part rule": [(pd_results["flag"] == "FLAG").sum()],
    "Min corrected/independent SE ratio": [pd_results["ratio_se"].min()],
    "Mean corrected/independent SE ratio": [pd_results["ratio_se"].mean()],
    "Max corrected/independent SE ratio": [pd_results["ratio_se"].max()]
})

pd_summary = pd_summary.round(3)
display(pd_summary)

	Number of companies	Significant using independent SE	Significant using corrected SE	Flagged using corrected two-part rule	Min corrected/independent SE ratio	Mean corrected/independent SE ratio	Max corrected/independent SE ratio
0	34	0	34	16	0.08	0.082	0.085

The table below shows selected companies with the largest relative coefficient shifts. The final flag requires both statistical significance and a relative shift of at least 10%.

Show selected proxy discrimination results code

selected_pd = pd_results.nlargest(10, "shift")[[
    "company", "shift", "se_ind", "se_full",
    "ratio_se", "z_ind", "z_full", "flag"
]].copy()

selected_pd = selected_pd.round({
    "shift": 1,
    "se_ind": 4,
    "se_full": 4,
    "ratio_se": 3,
    "z_ind": 2,
    "z_full": 2
})

display(selected_pd)

	company	shift	se_ind	se_full	ratio_se	z_ind	z_full	flag
9	Farmers Automobile Ins Assoc	22.2	0.0253	0.0022	0.085	1.07	12.59	FLAG
6	Economy Preferred Ins Co	16.5	0.0248	0.0021	0.084	1.15	13.71	FLAG
18	Metropolitan Cas Ins Co	15.5	0.0258	0.0022	0.084	1.10	13.20	FLAG
29	Travelers Home & Marine Ins Co	14.6	0.0247	0.0020	0.083	0.78	9.46	FLAG
28	Travelers Commercial Ins Co	14.6	0.0246	0.0020	0.083	0.79	9.51	FLAG
20	Metropolitan Prop & Cas Ins Co	12.5	0.0245	0.0021	0.085	1.34	15.73	FLAG
31	USAA Cas Ins Co	11.5	0.0229	0.0018	0.080	0.36	4.53	FLAG
21	Owners Ins Co	11.4	0.0255	0.0022	0.084	1.14	13.48	FLAG
33	United Serv Automobile Assn	11.1	0.0228	0.0018	0.081	0.38	4.75	FLAG
13	Geico Gen Ins Co	11.1	0.0245	0.0020	0.081	0.62	7.65	FLAG

Show proxy discrimination z-statistics plot code

plot_df = pd_results.sort_values("z_full").reset_index(drop=True)

plt.figure(figsize=(8, 4.5))
plt.plot(range(len(plot_df)), plot_df["z_ind"], marker="o", linestyle="", label="Independent SE")
plt.plot(range(len(plot_df)), plot_df["z_full"], marker="x", linestyle="", label="Corrected SE")
plt.axhline(Z_CRIT, linestyle="--", linewidth=1)
plt.axhline(-Z_CRIT, linestyle="--", linewidth=1)
plt.xlabel("Company, sorted by corrected z-statistic")
plt.ylabel("z-statistic")
plt.title("Correcting the covariance changes the proxy-discrimination test")
plt.legend()
plt.tight_layout()
plt.show()

Figure 3: Proxy-discrimination z-statistics under independent and corrected standard errors.

Note

The proxy-discrimination example shows why the dependence between the two regressions matters. If the cross-covariance is ignored, the standard error of the coefficient shift can be overstated, making the test too conservative.

Appendix: Supporting Materials

Data

Illinois quote data are from the ProPublica car insurance investigation (Larson et al. 2017). We do not redistribute the file here. To obtain it:

Download the ProPublica public data release (multi-state archive).
Extract car-insurance-public/data/il-per-zip.csv.
Save the file in this case study folder as il-per-zip.csv before running the code.

Review ProPublica’s data terms when using the download. Cite (Larson et al. 2017) in any publication that uses these data.

References

Colorado Division of Insurance. 2023. “CONCERNING QUANTITATIVE TESTING OF EXTERNAL CONSUMER DATA AND INFORMATION SOURCES, ALGORITHMS, AND PREDICTIVE MODELS USED FOR LIFE INSURANCE UNDERWRITING FOR UNFAIRLY DISCRIMINATORY OUTCOMES.” https://drive.google.com/file/d/1BMFuRKbh39Q7YckPqrhrCRuWp29vJ44O/view .

Huang, Fei, and Giles Hooker. 2026. “Fairness Testing for Algorithmic Pricing.”

Larson, Jeff, Julia Angwin, Lauren Kirchner, and Surya Mattu. 2017. “Minority Neighborhoods Pay Higher Car Insurance Premiums Than White Areas with the Same Risk.” ProPublica and Consumer Reports. https://www.propublica.org/article/minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk.

MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25.

New York State Department of Financial Services. 2024. “Proposed Insurance Circular Letter: January 17, 2024.” https://www.dfs.ny.gov/industry_guidance/circular_letters/cl2024_nn_proposed.

Schuirmann, Donald J. 1987. “A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability.” Journal of Pharmacokinetics and Biopharmaceutics 15 (6): 657–80.

Xin, Xi, Giles Hooker, and Fei Huang. 2026. “How Proxy Race Distorts Regression-Based Fairness Audits.” https://arxiv.org/abs/2603.17106.

Background

How to Read This Case Study

Audit protocol

Case Study Setup

Data and audit setting

Audit parameters

Import packages and define audit parameters

Supporting functions

Data preparation

Notation

Conditional Demographic Parity Audit

Audit model

Classical versus HC3 standard errors

CDP decision rule using TOST

Proxy Discrimination Audit

Coefficient-shift idea

Standard versus corrected standard errors

Appendix: Supporting Materials

Data

Further reading

References